Selection bias in the LETOR datasets
نویسندگان
چکیده
The LETOR datasets consist of data extracted from traditional IR test corpora. For each of a number of test topics, a set of documents has been extracted, in the form of features of each document-query pair, for use by a ranker. An examination of the ways in which documents were selected for each topic shows that the selection has (for each of the three corpora) a particular bias or skewness. This has some unexpected effects which may considerably influence any learning-to-rank exercise conducted on these datasets. The problems may be resolvable by modifying the datasets.
منابع مشابه
How to Make LETOR More Useful and Reliable
Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning to Rank for Information Retrieval (LR4IR 2007). Since then, this dataset has been widely used in ...
متن کاملIs learning to rank effective for Web search?
LETOR, the benchmark collection for learning to rank, helps make comparative study on different approaches in experimental research. Since the collection is constructed mainly based on TREC datasets, queries and documents in LETOR differ from true Web search scenario on some aspects, such as its incomplete link information, limited documents’ domain, and lack of user click information. Hence th...
متن کاملLETOR: A Benchmark Collection LETOR: A Benchmark Collection for Learning to Rank for Information Retrieval
Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning to Rank for Information Retrieval (LR4IR 2007). Since then, this dataset has been widely used in ...
متن کاملFast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملLearning Preferences with Hidden Common Cause Relations
Gaussian processes have successfully been used to learn preferences among entities as they provide nonparametric Bayesian approaches for model selection and probabilistic inference. For many entities encountered in real-world applications, however, there are complex relations between them. In this paper, we present a preference model which incorporates information on relations among entities. S...
متن کامل